Visualize the TensorFlow Speech Commands Dataset

This notebook relates to the TensorFlow Speech Commands Dataset. TensorFlow Speech Command dataset is a set of one-second .wav audio files, each containing a single spoken English word. These words are from a small set of commands, and are spoken by a variety of different speakers. It was designed for limited vocabulary speech recognition tasks. This dataset can be obtained for free from the IBM Developer Data Asset Exchange.

In this notebook, we will visualize, edit and compare sample audio files which saved by the previous notebook.

Table of Contents:

0. Prerequisites

Before you run this notebook complete the following steps:

  • Insert a project token
  • Import required packages

Insert a project token

When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:

# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context

If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:

  • Click on More -> Insert project token in the top-right menu section

ws-project.mov

  • This should insert a cell at the top of this notebook similar to the example given above.

    If an error is displayed indicating that no project token is defined, follow these instructions.

  • Run the newly inserted cell before proceeding with the notebook execution below

Import required packages

In [2]:
# Import required libraries
import pandas as pd
import io
# Math
import numpy as np
from scipy.fftpack import fft
from scipy import signal
from scipy.io import wavfile

from sklearn.decomposition import PCA

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import IPython.display as ipd

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import pandas as pd

%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

Check data assets in this project.

In [3]:
# Extract a sorted list of all assets associated with this project
file_names = sorted([d['name'] for d in project.get_files()])
file_names
Out[3]:
['bird_0a7c2a8d_nohash_0.wav',
 'bird_0b77ee66_nohash_0.wav',
 'bird_0c2ca723_nohash_1.wav',
 'bird_0eb48e10_nohash_1.wav',
 'bird_0fa1e7a9_nohash_0.wav',
 'bird_1d919a90_nohash_2.wav',
 'cat_0ab3b47d_nohash_0.wav',
 'dog_0b09edd3_nohash_1.wav',
 'off_0ab3b47d_nohash_0.wav',
 'on_0a7c2a8d_nohash_0.wav',
 'right_0a7c2a8d_nohash_0.wav',
 'sheila_00f0204f_nohash_1.wav',
 'up_0a7c2a8d_nohash_0.wav',
 'zero_0c40e715_nohash_0.wav']
In [4]:
len([d['name'] for d in project.get_files()])
Out[4]:
14

1. Data Visualization

What is sample rate?

Sample rate is how frequently samples are taken. It’s measured in “samples per second” and is usually expressed in kiloHertz (kHz), a unit meaning 1,000 times per second. Audio CDs, for example, have a sample rate of 44.1kHz, which means that the analog signal is sampled 44,100 times per second. If the audio sample rate is 16kHz, then the analog signal is sampled 16,000 times per second.

In [5]:
sample_rate, samples = wavfile.read(project.get_file('bird_0a7c2a8d_nohash_0.wav'))
print('Audio Sample Rate', sample_rate)
Audio Sample Rate 16000

Let's create a function that calculates the spectrogram of the raw audio files. A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. We will also be using a log scale for these spectrogram values. The weighted value is much more easier to plot and compare. Additionally, we remove any null/zero values as we are using a log scale.

The inputs of this function are samples extracted from the wav file, the sample rate, the size of the frame in milliseconds, the step (stride or skip) size in milliseconds and a small offset. The outputs are defined similar to the SciPy manual. The log_spectrum function returns three values, including an array of sample frequencies, an array of segment times and an adjusted log value of spectrogram of x.

We rescale the spectrogram with log function for the sake of calculation and visualization. Since there are much more large values than small values, we don't want the large ones dominate the computation. Taking log value, it compresses the differences between large and small values while still keeping the order.

Reference: log_spectrum method

In [6]:
def log_specgram(audio, sample_rate, window_size=20,
                 step_size=10, eps=1e-10):
    nperseg = int(round(window_size * sample_rate / 1e3))
    noverlap = int(round(step_size * sample_rate / 1e3))
    freqs, times, spec = signal.spectrogram(audio,
                                    fs=sample_rate,
                                    window='hann',
                                    nperseg=nperseg,
                                    noverlap=noverlap,
                                    detrend=False)
    return freqs, times, np.log(spec.T.astype(np.float32) + eps)

freqs, times, spectrogram = log_specgram(samples, sample_rate)
data = [go.Surface(z=spectrogram.T)]
layout = go.Layout(
    title='Specgtrogram of "bird" in 3d',
     scene = dict(
     yaxis = dict(title='Frequency'),
     xaxis = dict(title='Time'),
     zaxis = dict(title='Log amplitude'),
     ),
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

What are amplitude, frequency of an audio sample?

There are two main properties of a regular vibration - the amplitude and the frequency - which affect the way it sounds.

Amplitude is the size of the vibration, which determines how loud the sound is. The larger the size of vibrations, the louder the sound. Amplitude is important when balancing and controlling the loudness of sounds, such as with the volume control on your computer.

Frequency is the speed of the vibration, which determines the pitch of the sound. The faster the speed of the vibrations, the higher the tone.

Reference: Amplitude and Frequency

In [7]:
# Visualize a audio clip
fig = plt.figure(figsize=(14, 8))
ax1 = fig.add_subplot(111)
ax1.set_title('Raw wave of bird')
ax1.set_xlabel('time')
ax1.set_ylabel('Amplitude')
ax1.plot(np.linspace(0, sample_rate/len(samples), sample_rate), samples)
# Create a spectrogram of audio clip
fig = plt.figure(figsize=(14, 8))
ax2 = fig.add_subplot(111)
ax2.imshow(spectrogram.T, aspect='auto', origin='lower', 
           extent=[times.min(), times.max(), freqs.min(), freqs.max()])
ax2.set_yticks(freqs[::16])
ax2.set_xticks(times[::16])
ax2.set_title('Spectrogram of bird')
ax2.set_ylabel('Freqs in Hz')
ax2.set_xlabel('Seconds')
Out[7]:
Text(0.5, 0, 'Seconds')

In the 3-D Figure above, the amplitude has significant increase after time reaches 0.5 second. It means the main audio sound happens from time: 0.5 sec. Hear the audio sample, you can notice a silence gap at the beginning of the audio sample. Let's listen to the audio in next section.

2. Edit Audio

In this section, we will edit and modify the audio sample bird.wav by silence removal and audio resampling.

2.1 Silence removal

In previous visualization, we realize there is a silence gap at the beginning of the audio sample. We want to shorten the sound file and cut the silence part. Let's listen to the original "bird" sound file first.

In [8]:
ipd.Audio(samples, rate=sample_rate)
Out[8]:

Let's cut a bit of the file at the beginning and at the end. And listen to it again. Based on the amplitude plot above, the sound is from 0.4 second to 0.9 second. 0.43*16000 = 6880 and 0.9*16000 = 14000. Thus, we cut the audio sample into 7000 to 14000.

In [9]:
samples_cut = samples[7000:14000]
ipd.Audio(samples_cut, rate=sample_rate)
Out[9]:

Checking on the trimmed audio, we can agree that the entire word can be heard.

Next, we want to visualize the trimmed audio. VAD (Voice Activity Detection) will be a useful technique in here. Voice activity detection (VAD) is a technique in which the presence or absence of human speech is detected. Even though the words are short, there is still a lot of silence in them. The detection can be used to trigger a process.

A good VAD can reduce training size a lot, accelerating training speed significantly. Feel free to explore more. Reference: Voice Activity Detection

It is impossible to cut all the files manually and do this basing on the simple plot. But we can use webrtcvad package to have a good VAD.

Let's plot the audio sample, together with guessed alignment of 'b' 'ir' 'd' graphems.

In [10]:
freqs, times, spectrogram_cut = log_specgram(samples_cut, sample_rate)

fig = plt.figure(figsize=(14, 4))
ax1 = fig.add_subplot(111)
ax1.set_title('Raw Wave of bird sample')
ax1.set_ylabel('Amplitude')
ax1.plot(samples_cut)

fig = plt.figure(figsize=(14, 4))
ax2 = fig.add_subplot(111)
ax2.set_title('Spectrogram of bird sample')
ax2.set_ylabel('Frequencies * 0.1')
ax2.set_xlabel('Samples')
ax2.imshow(spectrogram_cut.T, aspect='auto', origin='lower', 
           extent=[times.min(), times.max(), freqs.min(), freqs.max()])
ax2.set_yticks(freqs[::16])
ax2.set_xticks(times[::16])
ax2.text(0.075, 1000, 'B', fontsize=18)
ax2.text(0.16, 1000, 'IR', fontsize=18)
ax2.text(0.27, 1000, 'D', fontsize=18)

xcoords = [0.05, 0.1, 0.23, 0.312]
for xc in xcoords:
    ax1.axvline(x=xc*16000, c='r')
    ax2.axvline(x=xc, c='r')

2.2 Resampling - dimensionality reduction

Resampling recordings is another way to reduce the dimensionality of data.

Most of speech related frequencies are present in a small band. The GSM (2G wireless communication) signal is sampled to 8,000 Hz, and people can still understand one another when talking on the telephone.

Resampling the dataset from 16k to 4k will reduce the size of data. We will perform resampling in this section.

In order to resample, we'll need to first calculate the FFT (Fast Fourier Transform).

What is Fast Fourier Transform?

A fast Fourier transform (FFT) is an algorithm that computes the discrete Fourier transform (DFT) of a sequence, or its inverse (IDFT). Reference: FFT Wiki

Human ear process audio sample similar to fast fourier transform mechanics. Our ears formulates a transform by converting sound—the waves of pressure traveling over time and through the atmosphere—into a spectrum, a description of the sound as a series of volumes at distinct pitches. The brain then turns this information into perceived sound.

The Fast Fourier Transform (FFT) is calculated below:

Reference: Fast Fourier Transform method

In [11]:
def custom_fft(y, fs):
    T = 1.0 / fs
    N = y.shape[0]
    yf = fft(y)
    xf = np.linspace(0.0, 1.0/(2.0*T), N//2)
    vals = 2.0/N * np.abs(yf[0:N//2])  # FFT is simmetrical, so we take just the first half
    return xf, vals

Let's read one audio sample, resample it, and listen. We can also compare FFT, Notice, that there is almost no information above 4000 Hz in original signal.

In [12]:
# Set new sample rate
new_sample_rate = 4000
# Read in the bird audio sample with new sample rate
sample_rate, samples = wavfile.read(project.get_file('bird_0a7c2a8d_nohash_0.wav'))
resampled = signal.resample(samples, int(new_sample_rate/sample_rate * samples.shape[0]))
In [13]:
# Play resampled audio
ipd.Audio(resampled, rate=new_sample_rate)
Out[13]:

Now, we want to visualize and compare the FFT graph of both original and resampled audio file.

In [14]:
xf, vals = custom_fft(samples, sample_rate)
plt.figure(figsize=(12, 4))
plt.title('FFT of recording sampled with ' + str(sample_rate) + ' Hz')
plt.xlim(left=0, right=8000)
plt.plot(xf, vals)
plt.xlabel('Frequency')
plt.grid()
plt.show()
xf, vals = custom_fft(resampled, new_sample_rate)
plt.figure(figsize=(12, 4))
plt.title('FFT of recording sampled with ' + str(new_sample_rate) + ' Hz')
plt.xlim(left=0, right=8000)
plt.plot(xf, vals)
plt.xlabel('Frequency')
plt.grid()
plt.show()

From the FFT graph, the resampled 4000 Hz FFT graph is truncated at 2000 frequency. It means that any sound above 2000 frequency will not be included in resampled audio. This explains why the resampled audio file is vague.

3. Audio Comparison

In this section, we want to compare the differences between audio files.

First, let's visualize all audio files which have distinct labels.

In [15]:
file_name = ['bird_0a7c2a8d_nohash_0.wav', 'cat_0ab3b47d_nohash_0.wav', 'dog_0b09edd3_nohash_1.wav',
              'off_0ab3b47d_nohash_0.wav', 'on_0a7c2a8d_nohash_0.wav', 'right_0a7c2a8d_nohash_0.wav',
              'sheila_00f0204f_nohash_1.wav', 'up_0a7c2a8d_nohash_0.wav', 'zero_0c40e715_nohash_0.wav']
fig = plt.figure(figsize=(8,8))
fig.suptitle('Spectrogram', fontsize=16)

# for each of the samples
for i, filepath in enumerate(file_name):
    # Make subplots
    plt.subplot(3,3,i+1)
    
    # pull the labels
    label = filepath.split('_')[0]
    plt.title(label)
    
    # create spectogram
    sample_rate, samples  = wavfile.read(project.get_file(filepath))
    _, _, spectrogram = log_specgram(samples, sample_rate)
    
    plt.imshow(spectrogram.T, aspect='auto', origin='lower')
    # set no axis label
    plt.axis('off')

# Create another spectrogram
fig = plt.figure(figsize=(8,13))
fig.suptitle('Raw Audio', fontsize=16)
for i, filepath in enumerate(file_name):
    plt.subplot(10,1,i+1)
    sample_rate, samples  = wavfile.read(project.get_file(filepath))
    plt.title(filepath.split('_')[0])
    plt.axis('off')
    plt.plot(samples)

Next, let's visualize audio files that have same label.

In [16]:
# Define bird files
file_name = [f for f in file_names if 'bird' in f]
fig = plt.figure(figsize=(8,8))
fig.suptitle('Spectrogram', fontsize=16)

for i, filepath in enumerate(file_name):
    # Make subplots
    plt.subplot(3,3,i+1)
    
    # pull the labels
    label = filepath.split('_')[0]
    plt.title(label)
    
    # create spectogram
    sample_rate, samples  = wavfile.read(project.get_file(filepath))
    _, _, spectrogram = log_specgram(samples, sample_rate)
    
    plt.imshow(spectrogram.T, aspect='auto', origin='lower')
    plt.axis('off')
    
fig = plt.figure(figsize=(8,13))
fig.suptitle('Raw Audio', fontsize=16)
for i, filepath in enumerate(file_name):
    plt.subplot(10,1,i+1)
    sample_rate, samples  = wavfile.read(project.get_file(filepath))
    plt.title(filepath.split('_')[0])
    plt.axis('off')
    plt.plot(samples)

Anomaly detection by PCA

Now, we want to check if any recordings are "outliers" which are different from all others. We can lower the dimensionality of the dataset and interactively check for any anomaly. Let's use Principal Component Analysis (PCA) for dimensionality reduction.

So, what is PCA? PCA is defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. To explain in simple words, given a dataset with a number of features, PCA finds a way to approximate those original features using less, more effective and explainable features which are statistically similar representations to the original ones.

Reference: Principle Component Analysis (PCA)

To use PCA in this scenario, we want to reshape an audio sample into 1024 columns by however many rows are needed and fill in zero values if necessary. Run the PCA algorithm, and do dimensionality reduction to three in our case. While losing some of the data that was in the original file, we compress each data files into three dimensional data points which we can use to compare.

In [17]:
ffts, audio_names = [], []
for filepath in file_names:
    sample_rate, samples  = wavfile.read(project.get_file(filepath))
    if samples.shape[0] != sample_rate:
        samples = np.append(samples, np.zeros((sample_rate - samples.shape[0], )))
    x, values = custom_fft(samples, sample_rate)
    ffts.append(values)
    audio_names.append(filepath)
# Set ffts from list type to array
ffts = np.array(ffts)

# Normalization: (Datapoint - mean)/standard deviation
ffts = (ffts - np.mean(ffts)) / np.std(ffts)

# Reduce the dimension to 3D
pca = PCA(n_components=3)
ffts = pca.fit_transform(ffts)

def interactive_3d_plot(data, names):
    scatt = go.Scatter3d(x=data[:, 0], y=data[:, 1], z=data[:, 2], mode='markers', text=names)
    data = go.Data([scatt])
    layout = go.Layout(title="Anomaly detection of Audio Samples")
    figure = go.Figure(data=data, layout=layout)
    py.iplot(figure)
    
interactive_3d_plot(ffts, audio_names)

From the 3-D graph, we can see two audio files are a little bit different from all others. This graph is to illustrate the concept of the model using only 14 data files. The result here might not be very informative, so feel free to run this model with 100, 1000, or even more data files.

Next steps

  • Close this notebook.

Authors

This notebook was created by the Center for Open-Source Data & AI Technologies.

Copyright © 2020 IBM. This notebook and its source code are released under the terms of the MIT License.

Love this notebook? Don't have an account yet?
Share it with your colleagues and help them discover the power of Watson Studio! Sign Up